Upload
doankien
View
216
Download
0
Embed Size (px)
Citation preview
Annotation Interoperability
Christian Chiarcos
EUROLAN-2015, 2015, July 21, Sibiu, Romania
Annotation Interoperability
Tue, July 21st, 09:00-10:30
Annotation Interoperability I
Ontologies of Linguistic Annotation – Motivations and Principles
Wed, July 22nd, 11:00-12:30
Annotation Interoperability II
Applications and Use Cases
Wed, July 22nd, 14:00-15:30
Annotation Interoperability III
Hands-on session
2
Annotation Interoperability Ontologies of Linguistic Annotation --
Motivations and Principles
1. Conceptual Interoperability
2. Towards a modular set of linked ontologies
3. Structure and history of OLiA ontologies
4. Use case I: Documentation and formalization
5. A closer look on an example: MULTEXT-East
6. Use case II: Cross-tagset search via query rewriting
3
Before we proceed …
… please download and install Protégé 5.0 (Desktop version) over the day
http://protege.stanford.edu/
for the hands-on session tomorrow
• we‘ll be building annotation models and link them
4
Conceptual Interoperability
Problem and earlier approaches
Interoperability
• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource
• e.g., tools
– using tagger A with parser B
• with a domain-adapted tagger A,* and a general-purpose parser B
* think of POS taggers for the biomedical domain (e.g., Genia) which use different tokenization strategies than out-of-the-box parsers
6
Interoperability
• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource
• e.g., corpora
– run the same query on corpus A and corpus B
more data, more likely significant results, comparable results
7
Interoperability
• Language resources (tools, corpora, dictionaries) are interoperable if we can (re-) use them as components of a coherent system/resource
• e.g., dictionary + tool/corpus
– use dictionary A as a component for tagger B
• if grammatical categories correspond to tags in the tagset that the tagger is trained on
8
Dimensions of Interoperability
• Structural Interoperability
– use the same format / mode of access
– more on this tomorrow
• Conceptual Interoperability
– use the same vocabularies, e.g., for linguistic annotations
– for the moment, we focus on the most elementary level: morphosyntax
(parts-of-speech, agreement features)
9
Dimensions of Interoperability
• Structural Interoperability
– use the same format / mode of access
– more on this tomorrow
• Conceptual Interoperability
– use the same vocabularies, e.g., for linguistic annotations
– for the moment, we focus on the most elementary level: morphosyntax
(parts-of-speech, agreement features)
10
Interoperability Issues: Monolingual
• When language ressources for a low-resource language are developed, different people have different ideas, e.g., for English (by the mid-1990s)
Susanne Penn
The AT DT
Fulton NP1s NNP
County NNL1cb NNP
Grand JJ NNP
Jury NN1c NNP
said VVDv VBD
Friday NPD1 NNP 11
Interoperability Issues: Monolingual
Susanne Penn
The AT DT
Fulton NP1s NNP
County NNL1cb NNP
Grand JJ NNP
Jury NN1c NNP
said VVDv VBD
Friday NPD1 NNP
395 tags word classes
morphological features syntactic features
lexical classes
57 tags word classes
number and degree
12
Interoperability Issues: Monolingual
• Integrating both resources allows us to
– apply more wide-scale statistical analyses
– increase training data for supervised POS tagging
– increase test data for unsupervised POS tagging
395 tags word classes
morphological features syntactic features
lexical classes
57 tags word classes
number and degree
13
Interoperability Issues: Multilingual
• with interoperable POS tags used across different languages, …
– we can apply the same unlexicalized NLP tools (e.g., parsers, cf. McDonald et al. 2013)
– we can perform comparative corpus studies
– we simplify multilingual annotation projection
14
Violations of Interoperability
• ROSANA Anaphor Resolution (Stuckardt 2001)
– required Connexor parser
• a commercial product
• UiMA annotation type systems
– NLP modules using the same annotation types are interoperable, but different groups develop their own, even for the same tools for the same language
15
Classical solution: Standardization
• Expert Advisory Group on Language Engineering (EAGLES)* – European standardization project (1993 – 1996)
– further elaborated by MULTEXT-East and ISLE/Parole
• Recommendations for POS tag sets – derived in a bottom-up manner
– no theoretical specification of tag sets, only identification of commonly used terms
* http://www.ilc.cnr.it/EAGLES96/home.html 16
Issues with EAGLES
… although linguists agree on the general ”common-sense” definitions of categories like proper noun, common noun etc, our analysis of competing tagsets for English corpora shows that these categories are in fact ‘fuzzy’, and different corpus tagging projects have adopted subtly but significantly different definitions, probably unaware that their analyses are incompatible with those of other linguists …
(Hughes et al. 1995)
* EAGLES is a classical case, our generation is just about to re-invent this wheel with „Universal Dependencies“ (http://universaldependencies.github.io/) 17
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)
• (pre-)determiner ? (all the books)
• pronoun ? (all of them)
• number ? (all books ~ 25 books)
• adjective ? (all books ~ green books) – suggested for inflecting languages
18
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)
– attributive pronouns (his book) • pronoun ?
• determiner ?
19
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories – quantifier (all, some, many)
– attributive pronouns (his book)
– adjectival participles (enduring freedom) • verb ?
• adjective ?
20
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories
• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,
– syntax vs. semantics
• attributive pronoun => determiner vs. pronoun
21
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories
• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,
– syntax vs. semantics
– syntax vs. morphology
• adjectival participles => adjective or verb
22
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories
• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,
– syntax vs. semantics
– syntax vs. morphology
– morphology vs. semantics
• ordinal numbers => adjectives vs. numerals
23
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories
• because „morphosyntax/parts-of-speech“ conflate different levels of description, e.g.,
– syntax vs. semantics
– syntax vs. morphology
– morphology vs. semantics
– homophony vs. linguistically defined categories
• VH for auxiliary have but also have as a main verb
24
Issues with EAGLES, cont‘ed
• Certain phenomena are hard to group with „major“ categories
• BUT
– standardization towards a meta-tagset implicitly enforces unambiguous classification
• taxonomical/tree-like structure
independent decisions by tagset designers incompatibilities, e.g., AUX „auxiliar verb“ vs. „potential
auxiliar verb“
25
Issues with EAGLES, cont‘ed
• EAGLES is prescriptive
– a standard-conformant tagset needs to provide certain categories
• even if not relevant to a language
• e.g. – Determiner (lacking for most Slavic languages)
– Adjective (lacking for Chinese)
– Noun-Verb distinction (debated for Fijian and Inuktitut)
EAGLES is specific to Western European
26
Issues with EAGLES, cont‘ed
• EAGLES is built in a bottom-up fashion
– if unknown phenomena for novel languages are encountered, they are added as optional (language-specific) features
– existing features may or may not be re-used
• later: some problematic cases from MULTEXT-East
27
Issues with EAGLES, cont‘ed
• EAGLES requires a 1:1-mapping* from standard-conformant tagsets
– every language-specific tag is mapped to exactly one EAGLES tag, so that they are equivalent
– given the definitorial problems mentioned before, we‘d like to express whether a mapping is perfect or imprecise
• or indicate partial overlaps with standard categories
28 * in fact, tags can be underspecified, so it is a 1:m mapping
Issues with EAGLES, cont‘ed
• EAGLES provides a fixed level of granularity
– more fine-grained categories are abandoned, e.g., semantic classes in Susanne
• for reasons of practicality, this level of granularity isn‘t at maximum scale
=> reductionism
29
Issues with EAGLES, cont‘ed
• EAGLES provides a fixed level of granularity
– more fine-grained categories are abandoned, e.g., semantic classes in Susanne
• for reasons of practicality, this level of granularity isn‘t at maximum scale
=> reductionism
many shortcomings of the standardization approach can be addressed by modelling
linguistic reference terminology by means of ontologies
30
Towards an ontology of linguistic terminology
• Goal – develop and apply an ontology as a terminological
backbone of different kinds of linguistic annotation
• Use cases – overcome differences in task-, domain- or
language-specific annotations
– provide a unified access to terminologically heterogeneously analysed
31
Towards an ontology of linguistic terminology
• Ontology – conceptualization of a certain domain
• e.g. a taxonomy of linguistic terms
– hierarchically and relationally structured
• OWL2/DL (Web Ontology Language) – formal description language for ontologies – formalizes description logics
• conceptual subsumption (rdfs:subClassOf) • logical operators (incl. disjunction and negation)
* Web Ontology Language, http://www.w3.org/TR/owl2-overview/ 32
Towards an ontology of linguistic terminology
• Against multiple tag sets
– unified representation of heterogeneous data
• linked to multiple different tag sets
– transparent
• abstraction from tag set specifics
– formal definitions
• based on description logics
33
Towards an ontology of linguistic terminology
• Against standardisation
– different conceptualizations
• language-specific traditions
• domain-specific conceptualizations
– different granularity
– implicit interpretation
• when mapping annotations to standard terms
34
Towards a modular set of linked ontologies
Just using a central ontology isn‘t enough
An joint, extensible terminology repository?
• Differences ... among different language resources and individual system objectives ... lead to variations in data category definitions and data category names.
• The use of uniform data category names and definitions ... contributes to system coherence and enhances the re-usability of data.
(Ide & Romary 2004)
36
The solution I
General Ontology of Linguistic Description (GOLD)
– ... large amounts of linguistic data on the Web ... from different languages can be automatically searched and compared ...
– ... the data and the various encoding schemes in which they are represented need an explicit semantics.
– ... a data model ... which is consistent with .... the Semantic Web ...
(Farrar & Langendoen 2003)
37
http://linguistics-ontology.org/gold
The solution II
ISO TC37/SC4 Data Category Registry (ISOcat)
– ... a family of data category standards designed to meet the
needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...
– ... to ensure interoperability among these domains ...
– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...
(Wright 2004)
38
http://isocat.org/
The solution II
ISO TC37/SC4 Data Category Registry (ISOcat)
– ... a family of data category standards designed to meet the
needs of terminologists and other language experts developing a variety of electronic linguistic resources. ...
– ... to ensure interoperability among these domains ...
– ... with an eye to facilitating ... wide-scale information handling environments such as the Semantic Web ...
The RELISH project aimed to harmonize GOLD and ISOcat, and they brought GOLD-2010 to ISOcat
unfortunately, this only meant to increase redundancy: 5 types of CommonNouns along each other
RelCat, not materialized yet
39
http://isocat.org/
https://tla.mpi.nl/relish/
The solution III-VIII
Documentation standards in typology – EUROTYP (Bakker et al. 1993)
– AUTOTYP (Bickel & Nichols 2002)
– Typological Database System (TDS) ontology (Dimitriadis et al. 2009)
Standardization initiatives and multi-language tagsets – EAGLES (Leech & Wilson 1996)
– MULTEXT/East (Erjavec 2010)
– Common POS tagset for Indian languages (Baskaran et al. 2008)
– Universal POS tags / Universal Dependencies (Petrov et al. 2012)
40
Imagine you plan to develop a tool that makes use of a terminology repository.
Which one would you choose ?
Maybe, it‘s not even your choice ...
... your clients may have their own preferences ... and different clients may have different preferences
Another Problem
41
Modular architecture – Instead of limiting ourselves to one, we may make
use of an intermediate representation that links to all of them
– If we want to avoid losing information by replacing annotations with reference categories, the original annotation scheme should be formalized as well
Ontologies of Linguistic Annotation (OLiA) http://purl.org/olia
Another Solution
42
Structure of OLiA ontologies
and their relation to other terminology repositories
Ontologies of Linguistic Annotation
modular OWL/DL ontologies – Annotation Models
• annotation scheme
– OLiA Reference Model • common terminology
– External Reference Models • existing terminology repositories
OLiA Reference Model – interface between annotations and
(multiple) terminology repositories
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
44
OLiA Reference Model
• harmonization of repositories of annotation terminology
• morphosyntax & morphology
– 39 schemes
– ~70 languages*
• syntax, discourse structure, anaphora, information structure
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
* including multilingual
annotation schemes:
Tapainen & Järvinen
(1997), and Dipper et al.
(2007), Erjavec (2010)
45
OLiA Reference Model
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Determiner
Morphosyntactic Category
Morphological Feature
Accusative Case
...
...
...
...
Case
concepts
properties hasCase
x x : MorphosyntacticCategory
y x : Case
is-a
is-a is-a
is-a
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Demonstrative Determiner
is-a
...
PronounOrDeterminer
46
OLiA Annotation Models
• OWL/DL formalizations of annotation schemes
– structure similar to the Reference Model
• individuals represent annotation values
– hasTag property
• string value of annotation
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
47
OLiA Annotation Model
POS
Adjective is-a
instance-of instance-of
STTS Annotation Model
ADJD ADJA
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
hasTag „ADJA“
STTS: German part-of-speech tags
48
OLiA Linking Model
Annotation model concepts are defined as subclasses of Reference Model concepts
– properties as sub-properties
– individuals as instances
The linking is physically separated from the models
– one possible interpretation of Annotation Model concepts in terms of the Reference Model
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
49
OLiA Linking Model
POS
Adjective is-a
instance-of instance-of
Attributive
Adjective
Morphosyntactic
Category
Adjective
is-a
is-a
instance-of
is-a
OLiA Reference Model
ADJD ADJA
STTS Annotation Model
hasTag „ADJA“
STTS Linking
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
50
OLiA: Terminology Repositories
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
OLiA Reference Model further linked to terminological repositories
– if they are modelled in OWL/DL
• GOLD (Chiarcos 2008)
• ISOcat (Chiarcos 2010)
• OntoTag (Buyko et al. 2008)
• TDS (Dimitriadis et al. 2009)
51
Extensibility
• OLiA Reference Model provides only a possible view on linguistic terminology, adaptations for other communities are encouraged
External Reference Model / Terminology Repository
– its concepts are superclasses of the OLiA Reference Model concept
• OLiA can be seen as a GOLD Community of Practice Extension
OLiA serving as interface to different tagsets
only one mapping needs to be defined 52
How to access OLiA ontologies
• modular structure – every model is an
independent ontology in a separate file
=> different name spaces
• declarative linking – linking model in a
separate file • stts-link.rdf
• to use OLiA directly import the linking model (and its imports)
olia.owl
stts.owl
Annotation Model
STTS
OLiA
Reference Model
Linking
Model stts-link.rdf
53
File Structure
olia.owl
stts.owl
OLiA
Reference Model
stts-link.rdf susa.owl
Annotation Model
Susanne
susa-link.rdf
Annotation Model
STTS
penn.owl
Annotation Model
Penn
penn-link.rdf ...
For every Annotation Model, there is at least one Linking Model linking it with the OLiA Reference Model
54
How to access multiple OLiA ontologies
olia.owl
stts.owl
OLiA
Reference Model
stts-link.rdf susa.owl
Annotation Model
Susanne
susa-link.rdf
Annotation Model
STTS
penn.owl
Annotation Model
Penn
penn-link.rdf ...
all.rdf Master file
Create a master file which imports the Linking Models with their imports
55
How to access external terminology repositories
olia.owl
OLiA
Reference Model
all.rdf Master file
Analoguously, external reference models (terminology repositories) can be included
Terminology Repository
e.g., GOLD
gold.owl Linking
Model gold-link.rdf
For querying (etc.), one can access external conceptual models
=> simplifications with SPARQL Update 56
Inferring Conceptual Descriptions
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
57
Inferring Conceptual Descriptions
OLiA Reference
Model
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Annotation Models
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Terminology Repositories
Further analogous inference of GOLD or ISOcat concepts
=> interoperable with both repositories
58
Terminology
• Translation from tags to ontological descriptions (triple sets)
– comparable representations for annotations of different origin
mapping between tagsets
concept-based ensemble combination architecture
concept-based corpus querying
more in a minute
59
A brief history of OLiA
Original research context
„Sustainability of Linguistic Data“ (2005-2008)
co-operation project between three German collaborative research centers
CRC441 „Linguistic data structures“ (Tübingen)
CRC538 „Multilingualism“ (Hamburg)
CRC632 „Information Structure“ (Potsdam/Berlin)
data collections of research projects should be kept available for later research activities
61
Original use case
• Motivation
– structural differences between different annotations/analyses
• hindering interoperability between concurrent taggers/tag sets
reference to a common terminological backbone
62
Original use case
• Goal – develop and apply an ontology as a terminological
backbone of different kinds of linguistic annotation
• Use cases • overcome differences in task-, domain- or language-
specific annotations
• provide a unified access to terminologically heterogeneously analysed
63
Developing an ontology Procedure
• derive a taxonomy of word classes from EAGLES „EAGLES ontology“
• augment with categories from other tag sets „E(xtended)-EAGLES ontology“
• harmonize E-EAGLES ontology with GOLD – enrichment of structures
– possible revisions of GOLD
„E-GOLD ontology“
64
Developing an ontology The EAGLES ontology (2005)
• hierarchical interpretation of EAGLES meta tags – word classes
• noun, verb, adjective, ...
=> top level categories
– recommended features • common noun vs. proper noun
=> subclasses
– purely inflectional features ignored • case, definiteness of nouns, mood, etc.
65
Developing an ontology The EAGLES ontology (2005)
Verb
FiniteVerb
Infinitive Participle
NonFiniteVerb
subclass
disjoint
66
Developing an ontology The extended EAGLES ontology (2005)
Verb
FiniteVerb
Infinitive Participle AdverbialParticiple
NonFiniteVerb
subclass
disjoint
„transgressive“ CRC441/B1 tagset
67
Developing an ontology E-GOLD (2006)
• use GOLD as a reference ontology – Number as a sub-class of Quantifier
• suggested additions to GOLD – CommonNoun vs. ProperNoun
• suggest revisions of GOLD
She‘s the one. • Number ⊑ Quantifier ⊑ Determiner ???
Quantifier as top-level category
68
Developing an ontology OLiA Reference Model
• extended in accordance with further annotation schemes
• extended for syntax (2007) and discourse (2014, experimental)
• linked to OntoTag (2008), ISOcat (2010), MULTEXT/East (2011), TDS (2012), lexinfo (2015)
• to be linked to Universal Dependencies => hands-on session tomorrow
69
Conceptual Interoperability
Penn
The DT
Fulton NNP
County NNP
Grand NNP
Jury NNP
said VBD
Friday NNP
Determiner ⊓ PronounOrDeterminer
Susanne
The AT Fulton NP1s
County NNL1cb
Grand JJ
Jury NN1c
said VVDv
Friday NPD1
ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
(MainVerb ⊔ StrictAuxiliaryVerb) ⊓ Verb ⊓ ∃hasTense.Past [sic!]
DefiniteArticle ⊓ Article ⊓ Determiner ⊓ PronounOrDeterminer
Surname ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
TopographicalNoun ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
Adjective ⊓ ∃hasDegree.Positive
CommonNoun ⊓ Noun ⊓ ∃hasNumber.Singular
TemporalNoun ⊓ ProperNoun ⊓ Noun ⊓ ∃hasNumber.Singular
MainVerb ⊓ Verb ⊓ ∃hasTense.Past
mostly identical triples, just a few more from
Susanne
70
Limitations
• The OLiA Reference Model is not fully axiomatized
• This is not possible in a language-independent way
„only Nouns, Pronouns, Determiners and Adjectives have Gender agreement“ ?
71
Limitations
• The OLiA Reference Model is not fully axiomatized
• This is not possible in a language-independent way
„only Nouns, Pronouns, Determiners and Adjectives have Gender agreement“ ?
– But what about Slavic verbs in past tense ?
72
Limitations
• The OLiA Reference Model is not fully axiomatized
• This is not possible in a language-independent way
„only FiniteVerbs have Tense“ ?
– past and present participles
– tensed infinitives in Old Norse and Old Greek
73
Limitations
• The OLiA Reference Model is not fully axiomatized
• This is not possible in a language-independent way
„Adverbs don‘t agree“ ?
– German meinetwegen, deinetwegen, seinetwegen
74
Limitations
• The OLiA Reference Model is not fully axiomatized
• This is not possible in a language-independent way
„Nouns are no finite Verbs“ ?
– Inuktitut
qimiutuq „(he) has a dog“ (v.3s.vp.) = „dog-owner“ (n.abs)
qimiutup „he has a dog“ (vpart.) „dog-owner“ (n.erg)
75
Limitations
• The OLiA Reference Model is not fully axiomatized
• no disjointness and cardinality axioms
– need to be defined in a language-specific way
=> can only be heuristically extrapolated from annotations
76
Linking the ontology
• applies LOD principles to the relation between tagsets
• used as a vocabulary – NLP Interchange Format, Apache Stanbol (for
linguistic annotations)
– lemon (machine-readable dictionaries)
• linked with bibliographical data – Virtuelle Fachbibliothek Allgemeine
Sprachwissenschaft (2015-2016)
77
Original Use Case
• Tagset formalization
– formal definitions
– uniformly layouted HTML, automatically generated from an ontology
• Advanced use cases
– ontology-based corpus querying
– ontology-based NLP applications
78
A closer look on an example: MULTEXT-East
Chiarcos & Erjavec (2011)
MULTEXT-East
• Corpus and dictionary project (Veronis & Ide 2004, Erjavec 2010)
• Idea: Extend EAGLES to Eastern Europe
• Parallel „1984“ corpus plus morphosyntactically annotated dictionaries – English
– Slavic (Bulgarian, Croatian, Czech, Macedonian, Polish, Resian, Russian, Serbian, Slovak, Slovene, Ukrainian)
– Finno-Ugric (Estonian, Hungarian)
– Romanian
– Persian
80
Building the MULTEXT-East Ontology
• Annotation guidelines in TEI/XML • Automatically converted to OWL2/DL using XSLT
– common specifications as TBox – language-specific as ABoxes importing the TBox
• discussed with MULTEXT-East users and maintainers – manually revised
• common specifications semiautomatically linked with OLiA Reference Model – OLiA Reference Model manually extended
http://nl.ijs.si/ME/owl/
81
MULTEXT-East Morphosyntactic Descriptions
Multiple documents
• common specifications
• language-specific
82
MULTEXT-East Morphosyntactic Descriptions
Multiple documents
• common specifications
• language-specific
provides all values used in Multext-East corpora/dictionaries (language-specific similar)
83
MULTEXT-East Common Specifications
• categories become top-level concepts
• Fine-grained parts of speech are encoded as features (~ EAGLES)
– e.g., Noun, Type=common (Nc)
– converted into sub-concepts
• choice followed OLiA Reference Model
• other features are encoded as object properties plus associated feature concept
84
MULTEXT-East Language Specifics
• Stored in separate document – No hierarchical structure inferred, import
common specifications
• Add tags as
individuals = Instance
of concepts
and features
with tag value
and object
properties to itself
85
Observations
Like EAGLES, MTE uses a positional tagset
bias against adding new attributes
systematic overload (of attributes and values)
A manual revision was thus unavoidable
86
Manual Revision
• Adjust automatically generated names
CorrelatCoordConjunction < Coord, Type=correlat expanded to CorrelativeCoordinatingConjunction
YesDefiniteness < Definiteness=yes simplified to Definite
87
Manual Revision
• Manual hierarchical reanalysis of (some) feature values
CliticProximalDeterminer ⊑ CliticDefiniteDeterminer
(could be presented as a flat list in MTE only)
88
Resolving Attribute Overload
• one attribute groups together unrelated phenomena from different languages
• Definiteness => – CliticDeterminerType (presence of a postfixed article
of Romanian, Bulgarian and Persian nouns and adjectives)
– ReductionFeature (full and reduced adjectives in many Slavic languages)
– PersonOfObject (the so-called ‘definite conjugation’ of Hungarian verbs)
89
Documenting Value Overload
• one attribute groups together unrelated phenomena from different languages
• Definiteness=yes (=> Definite), i.e., – clitic definite determiner (CliticDeterminerType in Rom. and Bulg.)
– clitic specific determiner (CliticDeterminerType in Persian)
– verb with a definite 3rd-person direct object (PersonOfObject in Hungarian)
Definite ⊑ CliticDefiniteDeterminer ⊔
CliticSpecificDeterminer ⊔ PersonOfObject
• In addition, add concept as “anchor” for such ambiguous features
Definite ⊑ AmbiguousDefinitenessFeature 90
Redundancy
• MTE tagsets were created bottom-up from existing resources, often unaware of earlier treatment of the same phenomenon
• e.g., reduced (vs. full) adjectives in Slavic
– Czech MTE Formation=nominal,
– Polish MTE Definiteness=short-art
marked by owl:equivalentClass
91
Definiteness in the MULTEXT-East Ontologies
Definiteness=1s2s (2) Definiteness=distal (d) Definiteness=full-art (f) Definiteness=no (f) Definiteness=proximal (p) Definiteness=short-art (s) Definiteness=yes (y)
92
Linking to OLiA
• After discussion with the MTE community, the TBox was semiautomatically linked with the OLiA Reference Model
• semiautomatically
– automatically link concepts with the same local name
– suggest linking candidates for concepts with overlapping local names => selection or comment
– comment linking status
• manually revise, check every concept with a comment
93
Extending OLiA
• During the semiautomatic linking, several cases came up where no OLiA concept could be found
• NumeralAgreementClass – SingularQuantifier (agreement pattern like numeral 1)
– DualQuantifier (agreement pattern like numeral 2)
– PaucalQuantifier (agreement pattern for quantities between singular [dual] and plural quantifiers)
– PluralQuantifier (agreement pattern like high numerals)
94
Practical Results
• developing the ontology helped identifying inconsistencies facilitated dialog between – an NLP person with limited knowledge about the
languages under consideration – language specialists with different degrees of
awareness of the structure of other MTE language models
• Resource can be used for documentation – using browseable OWL or the generated HTML
• Advanced uses possible
95
Use case II: Cross-tagset search
via query rewriting
Ontology-based query rewriting
• „Sustainability of linguistic resources“ – common terminological interface for querying
heterogeneously annotated data
• OntoClient (Rehm et al. 2008)
– Preprocessor for ontology-sensitive corpus queries
• OntoClient@ANNIS 1.0 – ANNIS (http://annis-tools.org/)
• web application for corpus querying
– OntoANNIS (Chiarcos & Goetze 2007)
ANNIS meets OntoClient
97
Ontology-based query rewriting
... pos in { Noun \ Nominal} & cat = ...
corpus query
ontology lookup: 1. retrieve instances and tags 2. application of set operators
Noun
ProperNoun
MassNoun CountableNoun
CommonNoun
Nominal
VerbalNoun
Substantive
tibet: ProperNoun
tibet:
InanimateNoun tibet:
AnimateNoun tibet:
Person
tibet: CommonNoun
NOM_inan
NOM_anim_lq
NOM_inan_lq NOM_pers
NOM_pers_anim
NAME
NOM_anim
Reference Model
Annotation Model
linking
return modified corpus query
... pos = NN | pos = NCOM | pos =
substantiv_masc_pl_dat_bel |pos =
substantiv_masc_pl_akk_unb | pos =
substantiv_fem_sg_ins_unb & cat =
...
Unparsed
String
Onto
Key
Onto
Left
Par
Onto
Concept
Onto
Concept
Onto
Op
Onto
Right
Par
Unparsed
String
98
OntoClient Query Language
Query := (UnparsedString* OntoQuery*)*
OntoQuery := OntoKey OntoLPar OntoExp OntoRPar
OntoKey := „in“
OntoLPar := „{“
OntoRPar := „}“
OntoExp := OntoConcept | (OntoExp OntoOp OntoExp)
OntoOp := „and“ | „or“ | „without“ | „&“ | „|“ | „\“
OntoConcept : Upper model concept |
Upper model relation „(“ Upper model concept „)“
99
OntoClient Interface to Query Language
UnparsedString
OntoKey
OntoLPar
OntoRPar
OntoExp
UnparsedString
Key
LeftPar
RightPar
tag (Disj tag)*
Disj
UnparsedString
„=“
„/“
„/“
„NP1m|NP1c“
„|“
Input Output e.g. TIGER
100
A sample application OntoANNIS
101
A sample application OntoANNIS
• OntoANNIS allowed to query across different annotation, but was a prototype only.
With the configurable OntoClient, a similar prototype for CWB was set up, connecting to an annotated edition of the Uppsala Corpus (Russian, 1 mio tokens) hosted at Tübingen
• Technology mixture can create a bottle-neck
Motivation to explore a native-SPARQL implementation
102
Tomorrow
• Second session on annotation interoperability
– SPARQL-native corpus querying
– Ontology-based ensemble combination
• Hands-on session on annotation interoperability
– building annotation models for different language editions of universal dependencies
– linking them
103
For tomorrow …
… please don‘t forget to download and install Protégé 5.0 (Desktop version)
http://protege.stanford.edu/
for the hands-on session tomorrow
• we‘ll be building annotation models and link them
104
Selected References
Christian Chiarcos (2008). An ontology of linguistic annotations. LDV Forum. 23(1).
Christian Chiarcos (2010). Grounding an ontology of linguistic annotations in the Data Category Registry. LREC 2010 Workshop on Language Resource and Language Technology Standards (LT<S), Valetta, Malta. 2010.
Christian Chiarcos, Tomaž Erjavec (2011). OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In Proceedings of the 5th Linguistic Annotation Workshop. Association for Computational Linguistics, 2011.
Scott Farrar, D. Terence Langendoen (2003). A linguistic ontology for the semantic web. Glot International 7.3: 97-100.
John Hughes, Clive Souter, Eric Atwell (1995), Automatic Extraction of Tagset Mappings from Parallel-annotated Corpora. In: From Texts to Tags: Issues in Multilingual Language Analysis.
Proceedings of SIGDAT Workshop in Conjunction with the 7th Conference of the European Chapter of the Association for Computational Linguistics. University College Dublin, Ireland.
Ryan McDonald, Joakim Nivre, et al. (2013). Universal Dependency Annotation for Multilingual Parsing. In Proc. ACL-2013, pp. 92-97.
Slav Petrov, Dipanjan Das, Ryan McDonald (2012). A universal part-of-speech tagset. Proc. LREC-2012.
Roland Stuckardt (2001). Design and Enhanced Evaluation of a Robust Anaphor Resolution Algorithm. Computational Linguistics 27(4):479-506
Sue Ellen Wright (2004). A Global Data Category Registry for Interoperable Language Resources. In Proc. LREC-2004.