105
Annotation Interoperability Christian Chiarcos [email protected] EUROLAN-2015, 2015, July 21, Sibiu, Romania

Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Embed Size (px)

Citation preview

Page 1: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Annotation Interoperability

Christian Chiarcos

[email protected]

EUROLAN-2015, 2015, July 21, Sibiu, Romania

Page 2: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Annotation Interoperability

Tue, July 21st, 09:00-10:30

Annotation Interoperability I

Ontologies of Linguistic Annotation – Motivations and Principles

Wed, July 22nd, 11:00-12:30

Annotation Interoperability II

Applications and Use Cases

Wed, July 22nd, 14:00-15:30

Annotation Interoperability III

Hands-on session

2

Page 3: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Annotation Interoperability Ontologies of Linguistic Annotation --

Motivations and Principles

1. Conceptual Interoperability

2. Towards a modular set of linked ontologies

3. Structure and history of OLiA ontologies

4. Use case I: Documentation and formalization

5. A closer look on an example: MULTEXT-East

6. Use case II: Cross-tagset search via query rewriting

3

Page 4: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Before we proceed …

… please download and install Protégé 5.0 (Desktop version) over the day

http://protege.stanford.edu/

for the hands-on session tomorrow

• we‘ll be building annotation models and link them

4

Page 5: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Conceptual Interoperability

Problem and earlier approaches

Page 6: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Interoperability

• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource

• e.g., tools

– using tagger A with parser B

• with a domain-adapted tagger A,* and a general-purpose parser B

* think of POS taggers for the biomedical domain (e.g., Genia) which use different tokenization strategies than out-of-the-box parsers

6

Page 7: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Interoperability

• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource

• e.g., corpora

– run the same query on corpus A and corpus B

more data, more likely significant results, comparable results

7

Page 8: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Interoperability

• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource

• e.g., dictionary + tool/corpus

– use dictionary A as a component for tagger B

• if grammatical categories correspond to tags in the tagset that the tagger is trained on

8

Page 9: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Dimensions of Interoperability

• Structural Interoperability

– use the same format / mode of access

– more on this tomorrow

• Conceptual Interoperability

– use the same vocabularies, e.g., for linguistic annotations

– for the moment, we focus on the most elementary level: morphosyntax

(parts-of-speech, agreement features)

9

Page 10: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Dimensions of Interoperability

• Structural Interoperability

– use the same format / mode of access

– more on this tomorrow

• Conceptual Interoperability

– use the same vocabularies, e.g., for linguistic annotations

– for the moment, we focus on the most elementary level: morphosyntax

(parts-of-speech, agreement features)

10

Page 11: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Interoperability Issues: Monolingual

• When language ressources for a low-resource language are developed, different people have different ideas, e.g., for English (by the mid-1990s)

Susanne Penn

The AT DT

Fulton NP1s NNP

County NNL1cb NNP

Grand JJ NNP

Jury NN1c NNP

said VVDv VBD

Friday NPD1 NNP 11

Page 12: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Interoperability Issues: Monolingual

Susanne Penn

The AT DT

Fulton NP1s NNP

County NNL1cb NNP

Grand JJ NNP

Jury NN1c NNP

said VVDv VBD

Friday NPD1 NNP

395 tags word classes

morphological features syntactic features

lexical classes

57 tags word classes

number and degree

12

Page 13: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Interoperability Issues: Monolingual

• Integrating both resources allows us to

– apply more wide-scale statistical analyses

– increase training data for supervised POS tagging

– increase test data for unsupervised POS tagging

395 tags word classes

morphological features syntactic features

lexical classes

57 tags word classes

number and degree

13

Page 14: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Interoperability Issues: Multilingual

• with interoperable POS tags used across different languages, …

– we can apply the same unlexicalized NLP tools (e.g., parsers, cf. McDonald et al. 2013)

– we can perform comparative corpus studies

– we simplify multilingual annotation projection

14

Page 15: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Violations of Interoperability

• ROSANA Anaphor Resolution (Stuckardt 2001)

– required Connexor parser

• a commercial product

• UiMA annotation type systems

– NLP modules using the same annotation types are interoperable, but different groups develop their own, even for the same tools for the same language

15

Page 16: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Classical solution: Standardization

• Expert Advisory Group on Language Engineering (EAGLES)* – European standardization project (1993 – 1996)

– further elaborated by MULTEXT-East and ISLE/Parole

• Recommendations for POS tag sets – derived in a bottom-up manner

– no theoretical specification of tag sets, only identification of commonly used terms

* http://www.ilc.cnr.it/EAGLES96/home.html 16

Page 17: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES

… although linguists agree on the general ”common-sense” definitions of categories like proper noun, common noun etc, our analysis of competing tagsets for English corpora shows that these categories are in fact ‘fuzzy’, and different corpus tagging projects have adopted subtly but significantly different definitions, probably unaware that their analyses are incompatible with those of other linguists …

(Hughes et al. 1995)

* EAGLES is a classical case, our generation is just about to re-invent this wheel with „Universal Dependencies“ (http://universaldependencies.github.io/) 17

Page 18: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)

• (pre-)determiner ? (all the books)

• pronoun ? (all of them)

• number ? (all books ~ 25 books)

• adjective ? (all books ~ green books) – suggested for inflecting languages

18

Page 19: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)

– attributive pronouns (his book) • pronoun ?

• determiner ?

19

Page 20: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)

– attributive pronouns (his book)

– adjectival participles (enduring freedom) • verb ?

• adjective ?

20

Page 21: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

• attributive pronoun => determiner vs. pronoun

21

Page 22: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

– syntax vs. morphology

• adjectival participles => adjective or verb

22

Page 23: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

– syntax vs. morphology

– morphology vs. semantics

• ordinal numbers => adjectives vs. numerals

23

Page 24: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,

– syntax vs. semantics

– syntax vs. morphology

– morphology vs. semantics

– homophony vs. linguistically defined categories

• VH for auxiliary have but also have as a main verb

24

Page 25: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• Certain phenomena are hard to group with „major“ categories

• BUT

– standardization towards a meta-tagset implicitly enforces unambiguous classification

• taxonomical/tree-like structure

independent decisions by tagset designers incompatibilities, e.g., AUX „auxiliar verb“ vs. „potential

auxiliar verb“

25

Page 26: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• EAGLES is prescriptive

– a standard-conformant tagset needs to provide certain categories

• even if not relevant to a language

• e.g. – Determiner (lacking for most Slavic languages)

– Adjective (lacking for Chinese)

– Noun-Verb distinction (debated for Fijian and Inuktitut)

EAGLES is specific to Western European

26

Page 27: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• EAGLES is built in a bottom-up fashion

– if unknown phenomena for novel languages are encountered, they are added as optional (language-specific) features

– existing features may or may not be re-used

• later: some problematic cases from MULTEXT-East

27

Page 28: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• EAGLES requires a 1:1-mapping* from standard-conformant tagsets

– every language-specific tag is mapped to exactly one EAGLES tag, so that they are equivalent

– given the definitorial problems mentioned before, we‘d like to express whether a mapping is perfect or imprecise

• or indicate partial overlaps with standard categories

28 * in fact, tags can be underspecified, so it is a 1:m mapping

Page 29: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• EAGLES provides a fixed level of granularity

– more fine-grained categories are abandoned, e.g., semantic classes in Susanne

• for reasons of practicality, this level of granularity isn‘t at maximum scale

=> reductionism

29

Page 30: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Issues with EAGLES, cont‘ed

• EAGLES provides a fixed level of granularity

– more fine-grained categories are abandoned, e.g., semantic classes in Susanne

• for reasons of practicality, this level of granularity isn‘t at maximum scale

=> reductionism

many shortcomings of the standardization approach can be addressed by modelling

linguistic reference terminology by means of ontologies

30

Page 31: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Towards an ontology of linguistic terminology

• Goal – develop and apply an ontology as a terminological

backbone of different kinds of linguistic annotation

• Use cases – overcome differences in task-, domain- or

language-specific annotations

– provide a unified access to terminologically heterogeneously analysed

31

Page 32: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Towards an ontology of linguistic terminology

• Ontology – conceptualization of a certain domain

• e.g. a taxonomy of linguistic terms

– hierarchically and relationally structured

• OWL2/DL (Web Ontology Language) – formal description language for ontologies – formalizes description logics

• conceptual subsumption (rdfs:subClassOf) • logical operators (incl. disjunction and negation)

* Web Ontology Language, http://www.w3.org/TR/owl2-overview/ 32

Page 33: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Towards an ontology of linguistic terminology

• Against multiple tag sets

– unified representation of heterogeneous data

• linked to multiple different tag sets

– transparent

• abstraction from tag set specifics

– formal definitions

• based on description logics

33

Page 34: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Towards an ontology of linguistic terminology

• Against standardisation

– different conceptualizations

• language-specific traditions

• domain-specific conceptualizations

– different granularity

– implicit interpretation

• when mapping annotations to standard terms

34

Page 35: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Towards a modular set of linked ontologies

Just using a central ontology isn‘t enough

Page 36: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

An joint, extensible terminology repository?

• Differences ... among different language resources and individual system objectives ... lead to variations in data category definitions and data category names.

• The use of uniform data category names and definitions ... contributes to system coherence and enhances the re-usability of data.

(Ide & Romary 2004)

36

Page 37: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

The solution I

General Ontology of Linguistic Description (GOLD)

– ... large amounts of linguistic data on the Web ... from different languages can be automatically searched and compared ...

– ... the data and the various encoding schemes in which they are represented need an explicit semantics.

– ... a data model ... which is consistent with .... the Semantic Web ...

(Farrar & Langendoen 2003)

37

http://linguistics-ontology.org/gold

Page 38: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

The solution II

ISO TC37/SC4 Data Category Registry (ISOcat)

– ... a family of data category standards designed to meet the

needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...

– ... to ensure interoperability among these domains ...

– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...

(Wright 2004)

38

http://isocat.org/

Page 39: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

The solution II

ISO TC37/SC4 Data Category Registry (ISOcat)

– ... a family of data category standards designed to meet the

needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...

– ... to ensure interoperability among these domains ...

– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...

The RELISH project aimed to harmonize GOLD and ISOcat, and they brought GOLD-2010 to ISOcat

unfortunately, this only meant to increase redundancy: 5 types of CommonNouns along each other

RelCat, not materialized yet

39

http://isocat.org/

https://tla.mpi.nl/relish/

Page 40: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

The solution III-VIII

Documentation standards in typology – EUROTYP (Bakker et al. 1993)

– AUTOTYP (Bickel & Nichols 2002)

– Typological Database System (TDS) ontology (Dimitriadis et al. 2009)

Standardization initiatives and multi-language tagsets – EAGLES (Leech & Wilson 1996)

– MULTEXT/East (Erjavec 2010)

– Common POS tagset for Indian languages (Baskaran et al. 2008)

– Universal POS tags / Universal Dependencies (Petrov et al. 2012)

40

Page 41: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Imagine you plan to develop a tool that makes use of a terminology repository.

Which one would you choose ?

Maybe, it‘s not even your choice ...

... your clients may have their own preferences ... and different clients may have different preferences

Another Problem

41

Page 42: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Modular architecture – Instead of limiting ourselves to one, we may make

use of an intermediate representation that links to all of them

– If we want to avoid losing information by replacing annotations with reference categories, the original annotation scheme should be formalized as well

Ontologies of Linguistic Annotation (OLiA) http://purl.org/olia

Another Solution

42

Page 43: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Structure of OLiA ontologies

and their relation to other terminology repositories

Page 44: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Ontologies of Linguistic Annotation

modular OWL/DL ontologies – Annotation Models

• annotation scheme

– OLiA Reference Model • common terminology

– External Reference Models • existing terminology repositories

OLiA Reference Model – interface between annotations and

(multiple) terminology repositories

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

44

Page 45: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OLiA Reference Model

• harmonization of repositories of annotation terminology

• morphosyntax & morphology

– 39 schemes

– ~70 languages*

• syntax, discourse structure, anaphora, information structure

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

* including multilingual

annotation schemes:

Tapainen & Järvinen

(1997), and Dipper et al.

(2007), Erjavec (2010)

45

Page 46: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OLiA Reference Model

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Determiner

Morphosyntactic Category

Morphological Feature

Accusative Case

...

...

...

...

Case

concepts

properties hasCase

x x : MorphosyntacticCategory

y x : Case

is-a

is-a is-a

is-a

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Demonstrative Determiner

is-a

...

PronounOrDeterminer

46

Page 47: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OLiA Annotation Models

• OWL/DL formalizations of annotation schemes

– structure similar to the Reference Model

• individuals represent annotation values

– hasTag property

• string value of annotation

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

47

Page 48: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OLiA Annotation Model

POS

Adjective is-a

instance-of instance-of

STTS Annotation Model

ADJD ADJA

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

hasTag „ADJA“

STTS: German part-of-speech tags

48

Page 49: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OLiA Linking Model

Annotation model concepts are defined as subclasses of Reference Model concepts

– properties as sub-properties

– individuals as instances

The linking is physically separated from the models

– one possible interpretation of Annotation Model concepts in terms of the Reference Model

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

49

Page 50: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OLiA Linking Model

POS

Adjective is-a

instance-of instance-of

Attributive

Adjective

Morphosyntactic

Category

Adjective

is-a

is-a

instance-of

is-a

OLiA Reference Model

ADJD ADJA

STTS Annotation Model

hasTag „ADJA“

STTS Linking

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

50

Page 51: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OLiA: Terminology Repositories

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

OLiA Reference Model further linked to terminological repositories

– if they are modelled in OWL/DL

• GOLD (Chiarcos 2008)

• ISOcat (Chiarcos 2010)

• OntoTag (Buyko et al. 2008)

• TDS (Dimitriadis et al. 2009)

51

Page 52: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Extensibility

• OLiA Reference Model provides only a possible view on linguistic terminology, adaptations for other communities are encouraged

External Reference Model / Terminology Repository

– its concepts are superclasses of the OLiA Reference Model concept

• OLiA can be seen as a GOLD Community of Practice Extension

OLiA serving as interface to different tagsets

only one mapping needs to be defined 52

Page 53: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

How to access OLiA ontologies

• modular structure – every model is an

independent ontology in a separate file

=> different name spaces

• declarative linking – linking model in a

separate file • stts-link.rdf

• to use OLiA directly import the linking model (and its imports)

olia.owl

stts.owl

Annotation Model

STTS

OLiA

Reference Model

Linking

Model stts-link.rdf

53

Page 54: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

File Structure

olia.owl

stts.owl

OLiA

Reference Model

stts-link.rdf susa.owl

Annotation Model

Susanne

susa-link.rdf

Annotation Model

STTS

penn.owl

Annotation Model

Penn

penn-link.rdf ...

For every Annotation Model, there is at least one Linking Model linking it with the OLiA Reference Model

54

Page 55: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

How to access multiple OLiA ontologies

olia.owl

stts.owl

OLiA

Reference Model

stts-link.rdf susa.owl

Annotation Model

Susanne

susa-link.rdf

Annotation Model

STTS

penn.owl

Annotation Model

Penn

penn-link.rdf ...

all.rdf Master file

Create a master file which imports the Linking Models with their imports

55

Page 56: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

How to access external terminology repositories

olia.owl

OLiA

Reference Model

all.rdf Master file

Analoguously, external reference models (terminology repositories) can be included

Terminology Repository

e.g., GOLD

gold.owl Linking

Model gold-link.rdf

For querying (etc.), one can access external conceptual models

=> simplifications with SPARQL Update 56

Page 57: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Inferring Conceptual Descriptions

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

57

Page 58: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Inferring Conceptual Descriptions

OLiA Reference

Model

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Annotation Models

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Terminology Repositories

Further analogous inference of GOLD or ISOcat concepts

=> interoperable with both repositories

58

Page 59: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Terminology

• Translation from tags to ontological descriptions (triple sets)

– comparable representations for annotations of different origin

mapping between tagsets

concept-based ensemble combination architecture

concept-based corpus querying

more in a minute

59

Page 60: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

A brief history of OLiA

Page 61: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Original research context

„Sustainability of Linguistic Data“ (2005-2008)

co-operation project between three German collaborative research centers

CRC441 „Linguistic data structures“ (Tübingen)

CRC538 „Multilingualism“ (Hamburg)

CRC632 „Information Structure“ (Potsdam/Berlin)

data collections of research projects should be kept available for later research activities

61

Page 62: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Original use case

• Motivation

– structural differences between different annotations/analyses

• hindering interoperability between concurrent taggers/tag sets

reference to a common terminological backbone

62

Page 63: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Original use case

• Goal – develop and apply an ontology as a terminological

backbone of different kinds of linguistic annotation

• Use cases • overcome differences in task-, domain- or language-

specific annotations

• provide a unified access to terminologically heterogeneously analysed

63

Page 64: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Developing an ontology Procedure

• derive a taxonomy of word classes from EAGLES „EAGLES ontology“

• augment with categories from other tag sets „E(xtended)-EAGLES ontology“

• harmonize E-EAGLES ontology with GOLD – enrichment of structures

– possible revisions of GOLD

„E-GOLD ontology“

64

Page 65: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Developing an ontology The EAGLES ontology (2005)

• hierarchical interpretation of EAGLES meta tags – word classes

• noun, verb, adjective, ...

=> top level categories

– recommended features • common noun vs. proper noun

=> subclasses

– purely inflectional features ignored • case, definiteness of nouns, mood, etc.

65

Page 66: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Developing an ontology The EAGLES ontology (2005)

Verb

FiniteVerb

Infinitive Participle

NonFiniteVerb

subclass

disjoint

66

Page 67: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Developing an ontology The extended EAGLES ontology (2005)

Verb

FiniteVerb

Infinitive Participle AdverbialParticiple

NonFiniteVerb

subclass

disjoint

„transgressive“ CRC441/B1 tagset

67

Page 68: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Developing an ontology E-GOLD (2006)

• use GOLD as a reference ontology – Number as a sub-class of Quantifier

• suggested additions to GOLD – CommonNoun vs. ProperNoun

• suggest revisions of GOLD

She‘s the one. • Number ⊑ Quantifier ⊑ Determiner ???

Quantifier as top-level category

68

Page 69: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Developing an ontology OLiA Reference Model

• extended in accordance with further annotation schemes

• extended for syntax (2007) and discourse (2014, experimental)

• linked to OntoTag (2008), ISOcat (2010), MULTEXT/East (2011), TDS (2012), lexinfo (2015)

• to be linked to Universal Dependencies => hands-on session tomorrow

69

Page 70: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Conceptual Interoperability

Penn

The DT

Fulton NNP

County NNP

Grand NNP

Jury NNP

said VBD

Friday NNP

Determiner ⊓ PronounOrDeterminer

Susanne

The AT Fulton NP1s

County NNL1cb

Grand JJ

Jury NN1c

said VVDv

Friday NPD1

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

(MainVerb ⊔ StrictAuxiliaryVerb) ⊓ Verb ⊓ ∃hasTense.Past [sic!]

DefiniteArticle ⊓ Article ⊓ Determiner ⊓ PronounOrDeterminer

Surname ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

TopographicalNoun ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

Adjective ⊓ ∃hasDegree.Positive

CommonNoun ⊓ Noun ⊓ ∃hasNumber.Singular

TemporalNoun ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular

MainVerb ⊓ Verb ⊓ ∃hasTense.Past

mostly identical triples, just a few more from

Susanne

70

Page 71: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„only Nouns, Pronouns, Determiners and Adjectives have Gender agreement“ ?

71

Page 72: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„only Nouns, Pronouns, Determiners and Adjectives have Gender agreement“ ?

– But what about Slavic verbs in past tense ?

72

Page 73: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„only FiniteVerbs have Tense“ ?

– past and present participles

– tensed infinitives in Old Norse and Old Greek

73

Page 74: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„Adverbs don‘t agree“ ?

– German meinetwegen, deinetwegen, seinetwegen

74

Page 75: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Limitations

• The OLiA Reference Model is not fully axiomatized

• This is not possible in a language-independent way

„Nouns are no finite Verbs“ ?

– Inuktitut

qimiutuq „(he) has a dog“ (v.3s.vp.) = „dog-owner“ (n.abs)

qimiutup „he has a dog“ (vpart.) „dog-owner“ (n.erg)

75

Page 76: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Limitations

• The OLiA Reference Model is not fully axiomatized

• no disjointness and cardinality axioms

– need to be defined in a language-specific way

=> can only be heuristically extrapolated from annotations

76

Page 77: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Linking the ontology

• applies LOD principles to the relation between tagsets

• used as a vocabulary – NLP Interchange Format, Apache Stanbol (for

linguistic annotations)

– lemon (machine-readable dictionaries)

• linked with bibliographical data – Virtuelle Fachbibliothek Allgemeine

Sprachwissenschaft (2015-2016)

77

Page 78: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Original Use Case

• Tagset formalization

– formal definitions

– uniformly layouted HTML, automatically generated from an ontology

• Advanced use cases

– ontology-based corpus querying

– ontology-based NLP applications

78

Page 79: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

A closer look on an example: MULTEXT-East

Chiarcos & Erjavec (2011)

Page 80: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

MULTEXT-East

• Corpus and dictionary project (Veronis & Ide 2004, Erjavec 2010)

• Idea: Extend EAGLES to Eastern Europe

• Parallel „1984“ corpus plus morphosyntactically annotated dictionaries – English

– Slavic (Bulgarian, Croatian, Czech, Macedonian, Polish, Resian, Russian, Serbian, Slovak, Slovene, Ukrainian)

– Finno-Ugric (Estonian, Hungarian)

– Romanian

– Persian

80

Page 81: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Building the MULTEXT-East Ontology

• Annotation guidelines in TEI/XML • Automatically converted to OWL2/DL using XSLT

– common specifications as TBox – language-specific as ABoxes importing the TBox

• discussed with MULTEXT-East users and maintainers – manually revised

• common specifications semiautomatically linked with OLiA Reference Model – OLiA Reference Model manually extended

http://nl.ijs.si/ME/owl/

81

Page 82: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

MULTEXT-East Morphosyntactic Descriptions

Multiple documents

• common specifications

• language-specific

82

Page 83: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

MULTEXT-East Morphosyntactic Descriptions

Multiple documents

• common specifications

• language-specific

provides all values used in Multext-East corpora/dictionaries (language-specific similar)

83

Page 84: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

MULTEXT-East Common Specifications

• categories become top-level concepts

• Fine-grained parts of speech are encoded as features (~ EAGLES)

– e.g., Noun, Type=common (Nc)

– converted into sub-concepts

• choice followed OLiA Reference Model

• other features are encoded as object properties plus associated feature concept

84

Page 85: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

MULTEXT-East Language Specifics

• Stored in separate document – No hierarchical structure inferred, import

common specifications

• Add tags as

individuals = Instance

of concepts

and features

with tag value

and object

properties to itself

85

Page 86: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Observations

Like EAGLES, MTE uses a positional tagset

bias against adding new attributes

systematic overload (of attributes and values)

A manual revision was thus unavoidable

86

Page 87: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Manual Revision

• Adjust automatically generated names

CorrelatCoordConjunction < Coord, Type=correlat expanded to CorrelativeCoordinatingConjunction

YesDefiniteness < Definiteness=yes simplified to Definite

87

Page 88: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Manual Revision

• Manual hierarchical reanalysis of (some) feature values

CliticProximalDeterminer ⊑ CliticDefiniteDeterminer

(could be presented as a flat list in MTE only)

88

Page 89: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Resolving Attribute Overload

• one attribute groups together unrelated phenomena from different languages

• Definiteness => – CliticDeterminerType (presence of a postfixed article

of Romanian, Bulgarian and Persian nouns and adjectives)

– ReductionFeature (full and reduced adjectives in many Slavic languages)

– PersonOfObject (the so-called ‘definite conjugation’ of Hungarian verbs)

89

Page 90: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Documenting Value Overload

• one attribute groups together unrelated phenomena from different languages

• Definiteness=yes (=> Definite), i.e., – clitic definite determiner (CliticDeterminerType in Rom. and Bulg.)

– clitic specific determiner (CliticDeterminerType in Persian)

– verb with a definite 3rd-person direct object (PersonOfObject in Hungarian)

Definite ⊑ CliticDefiniteDeterminer ⊔

CliticSpecificDeterminer ⊔ PersonOfObject

• In addition, add concept as “anchor” for such ambiguous features

Definite ⊑ AmbiguousDefinitenessFeature 90

Page 91: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Redundancy

• MTE tagsets were created bottom-up from existing resources, often unaware of earlier treatment of the same phenomenon

• e.g., reduced (vs. full) adjectives in Slavic

– Czech MTE Formation=nominal,

– Polish MTE Definiteness=short-art

marked by owl:equivalentClass

91

Page 92: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Definiteness in the MULTEXT-East Ontologies

Definiteness=1s2s (2) Definiteness=distal (d) Definiteness=full-art (f) Definiteness=no (f) Definiteness=proximal (p) Definiteness=short-art (s) Definiteness=yes (y)

92

Page 93: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Linking to OLiA

• After discussion with the MTE community, the TBox was semiautomatically linked with the OLiA Reference Model

• semiautomatically

– automatically link concepts with the same local name

– suggest linking candidates for concepts with overlapping local names => selection or comment

– comment linking status

• manually revise, check every concept with a comment

93

Page 94: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Extending OLiA

• During the semiautomatic linking, several cases came up where no OLiA concept could be found

• NumeralAgreementClass – SingularQuantifier (agreement pattern like numeral 1)

– DualQuantifier (agreement pattern like numeral 2)

– PaucalQuantifier (agreement pattern for quantities between singular [dual] and plural quantifiers)

– PluralQuantifier (agreement pattern like high numerals)

94

Page 95: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Practical Results

• developing the ontology helped identifying inconsistencies facilitated dialog between – an NLP person with limited knowledge about the

languages under consideration – language specialists with different degrees of

awareness of the structure of other MTE language models

• Resource can be used for documentation – using browseable OWL or the generated HTML

• Advanced uses possible

95

Page 96: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Use case II: Cross-tagset search

via query rewriting

Page 97: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Ontology-based query rewriting

• „Sustainability of linguistic resources“ – common terminological interface for querying

heterogeneously annotated data

• OntoClient (Rehm et al. 2008)

– Preprocessor for ontology-sensitive corpus queries

• OntoClient@ANNIS 1.0 – ANNIS (http://annis-tools.org/)

• web application for corpus querying

– OntoANNIS (Chiarcos & Goetze 2007)

ANNIS meets OntoClient

97

Page 98: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Ontology-based query rewriting

... pos in { Noun \ Nominal} & cat = ...

corpus query

ontology lookup: 1. retrieve instances and tags 2. application of set operators

Noun

ProperNoun

MassNoun CountableNoun

CommonNoun

Nominal

VerbalNoun

Substantive

tibet: ProperNoun

tibet:

InanimateNoun tibet:

AnimateNoun tibet:

Person

tibet: CommonNoun

NOM_inan

NOM_anim_lq

NOM_inan_lq NOM_pers

NOM_pers_anim

NAME

NOM_anim

Reference Model

Annotation Model

linking

return modified corpus query

... pos = NN | pos = NCOM | pos =

substantiv_masc_pl_dat_bel |pos =

substantiv_masc_pl_akk_unb | pos =

substantiv_fem_sg_ins_unb & cat =

...

Unparsed

String

Onto

Key

Onto

Left

Par

Onto

Concept

Onto

Concept

Onto

Op

Onto

Right

Par

Unparsed

String

98

Page 99: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OntoClient Query Language

Query := (UnparsedString* OntoQuery*)*

OntoQuery := OntoKey OntoLPar OntoExp OntoRPar

OntoKey := „in“

OntoLPar := „{“

OntoRPar := „}“

OntoExp := OntoConcept | (OntoExp OntoOp OntoExp)

OntoOp := „and“ | „or“ | „without“ | „&“ | „|“ | „\“

OntoConcept : Upper model concept |

Upper model relation „(“ Upper model concept „)“

99

Page 100: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

OntoClient Interface to Query Language

UnparsedString

OntoKey

OntoLPar

OntoRPar

OntoExp

UnparsedString

Key

LeftPar

RightPar

tag (Disj tag)*

Disj

UnparsedString

„=“

„/“

„/“

„NP1m|NP1c“

„|“

Input Output e.g. TIGER

100

Page 101: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

A sample application OntoANNIS

101

Page 102: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

A sample application OntoANNIS

• OntoANNIS allowed to query across different annotation, but was a prototype only.

With the configurable OntoClient, a similar prototype for CWB was set up, connecting to an annotated edition of the Uppsala Corpus (Russian, 1 mio tokens) hosted at Tübingen

• Technology mixture can create a bottle-neck

Motivation to explore a native-SPARQL implementation

102

Page 103: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Tomorrow

• Second session on annotation interoperability

– SPARQL-native corpus querying

– Ontology-based ensemble combination

• Hands-on session on annotation interoperability

– building annotation models for different language editions of universal dependencies

– linking them

103

Page 104: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

For tomorrow …

… please don‘t forget to download and install Protégé 5.0 (Desktop version)

http://protege.stanford.edu/

for the hands-on session tomorrow

• we‘ll be building annotation models and link them

104

Page 105: Aktuelle Themen der Angewandten Informatik Semantische ...acoli.informatik.uni-frankfurt.de/resources/llod/session1... · Annotation Interoperability Ontologies of Linguistic Annotation

Selected References

Christian Chiarcos (2008). An ontology of linguistic annotations. LDV Forum. 23(1).

Christian Chiarcos (2010). Grounding an ontology of linguistic annotations in the Data Category Registry. LREC 2010 Workshop on Language Resource and Language Technology Standards (LT&LTS), Valetta, Malta. 2010.

Christian Chiarcos, Tomaž Erjavec (2011). OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In Proceedings of the 5th Linguistic Annotation Workshop. Association for Computational Linguistics, 2011.

Scott Farrar, D. Terence Langendoen (2003). A linguistic ontology for the semantic web. Glot International 7.3: 97-100.

John Hughes, Clive Souter, Eric Atwell (1995), Automatic Extraction of Tagset Mappings from Parallel-annotated Corpora. In: From Texts to Tags: Issues in Multilingual Language Analysis.

Proceedings of SIGDAT Workshop in Conjunction with the 7th Conference of the European Chapter of the Association for Computational Linguistics. University College Dublin, Ireland.

Ryan McDonald, Joakim Nivre, et al. (2013). Universal Dependency Annotation for Multilingual Parsing. In Proc. ACL-2013, pp. 92-97.

Slav Petrov, Dipanjan Das, Ryan McDonald (2012). A universal part-of-speech tagset. Proc. LREC-2012.

Roland Stuckardt (2001). Design and Enhanced Evaluation of a Robust Anaphor Resolution Algorithm. Computational Linguistics 27(4):479-506

Sue Ellen Wright (2004). A Global Data Category Registry for Interoperable Language Resources. In Proc. LREC-2004.